Lowering costs with Spot Instances (BETA)
Interruptible machines​
Cloud machines are normally expensive. However, if your job can support being interrupted at any time (ie: fine tuning, model that can be restarted) then you could use spot instances in Grid to lower training and development costs.
Enable Spot Instances via the UI​
Enable Spot Instances via the CLI​
grid run --use_spot pl_mnist.py
Prepare code for interruption​
To take advantage of interruptible machines, make sure of a few things:
- You are saving checkpoints or any state you need. Grid automatically picks these up into your artifacts.
- Make sure your code can be restarted from a checkpoint or state file.
Restarting interrupted jobs​
Once the machine is interrupted, your job on Grid will stop. If you want to continue running your code do the following:
- Navigate to your experiment artifacts.
- Copy the link to the state files (or checkpoint) that you need.
- Resubmit the job with the path to that file.
For example, assume your script has an argument called --ck_path
grid run --use_spot main.py --ck_path https://grid.ai/url/to/checkpoint.ckpt
note
If you have additional questions about Runs, visit the FAQ. The section is periodically updated this with common questions from the Grid community.